e-Appendix B 


Linear Algebra 


The basic storage structure for the data is a matrix X which has N rows, one 
for each data point; each data point is a row-vector. 


ie 
x Ti i2 '* La 
T 
Ap T21 T22 ` La 
X = : — 
xh P x oo 
N N,1 N,2 N,d 


When we write X € RX? we mean a matrix like the one above, which in this 
case is N x d (N rows and d columns). Linear algebra plays an important role 
in manipulating such matrices when learning from data, because the matrix X 
can be viewed as a linear operator that takes a set of weights w and outputs 
in-sample predictions: 

y = Xw. 


B.1 Basic Properties of Vectors and Matrices 


A (column) vector v € R? has d components v1,..., va. The matrix X above 
could be viewed as a set of d column vectors or a set of N row vectors. A 
vector can be multiplied by a scalar in the usual way, by multiplying each 
component by the scalar, and two vectors can be added together by adding 
the respective components together. 

The vectors v1,...,Vm € R? are linearly independent if no non-trivial 
linear combination of them can equal 0: 


m 
Sav; =0 = a, =0 fori=1,...,m. 
i=l 


If the vectors are not linearly independent, then they are linearly dependent. 
Given two vectors v,u, the standard Euclidean inner product (dot product) 
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e-B. LINEAR ALGEBRA B.1. BAsic PROPERTIES OF VECTORS AND MATRICES 


is v7u = ae viu; and the Euclidean norm is ||v||? = v7v = Dee v7. Let 0 


be the angle between v and u in the standard geometric sense. Then 
vu = ||y||[|ul|cos8 = (vu)? < Iivi’ llul’, 


where the latter inequality is known as the Cauchy-Schwarz inequality. The 


two vectors v and u are orthogonal if vu = 0; if in addition ||v|| = ||uļ| = 1, 
then the two vectors are orthonormal (orthogonal and have unit norm). 
A basis v1,...,Vq has the property that any u € R? can be written as a 


unique linear combination of the basis vectors, u = SL QiVvi. It follows that 
the basis must be linearly independent. (We have implicitly assumed that the 
cardinality of any basis is d, which is indeed the case.) A basis v1,...,Va is 
an orthonormal basis if vjv; = 6;; (equal to 1 if i = j and zero otherwise), 
that is the basis vectors are pairwise orthonormal. 


Exercise B.1 
This exercise introduces some fundamental properties of vectors and bases. 


(a) Are the following sets of vectors dependent or independent? 


(ED tbl Gl) E tlbi 


(b) Which of the sets in (a) span R?? 


c) Show that the expansion u = oe aivi holds for unique a; if and 
Gaal 
only if vi,..., Va are independent. 


(d) Which of the sets in (a) are a basis for R°? 
(e) Show that any set of vectors containing the zero vector is dependent. 


(f) Show: if v1,...,Vm E€ R? are independent then m < d (the max- 
imum cardinality of an independent set is the dimension d). [Hint: 
Induction on d.] 


(g) Show: if v1,...,Vvm span R, then m > d. [Hint: The span does not 
change if you add a multiple of one of the vectors to another. Hence, 
transform the set to a more convenient one.] 


(h) Show: every basis has cardinality equal to the dimension d. 


(i) Show: If vi, v2 are orthogonal, they are independent. If vi, v2 are in- 
dependent, then vi, v2—Avi have the same span and are orthogonal, 
where A = viv2/vivi. What if vi, v2 are dependent? 


(j) Show that any set of independent vectors can be transformed to a set 
of pairwise orthogonal vectors with the same span. 


(k) Given a basis vi,..., Va show how to construct an orthonormal basis 
V1,..., Va. [Hint: Start by making v2,...,Vva orthogonal to v1.] 
(I) If vi,...,va is an orthonormal basis, then show that any vector u 


d 


has the (unique) expansion u = $ ;_; (u'vi) vi. 


(m) Show: Any set of d linearly independent vectors is a basis for R°. 
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If vi,...,Vq@ is an orthonormal basis, then the coefficients (u"v;) in the expan- 
sion u = Si (uTva)vi are the coordinates of u with respect to the (ordered) 
basis v1, ..., Va. These d coordinates form a vector in d dimensions. When we 


write u as a vector of its coordinates, as we have been doing, we are implicitly 
assuming that these coordinates are with respect to the standard orthonormal 
basis €1,..., ea: 


1 0 0 0 
0 1 0 0 
e; = 0 ’ eg = 0 ’ e3 = 1 ’ ’ eq = 0 
0 0 0 1 


Matrices. Let A € RY*4 (N x d matrix) and B,C € R¢*™ be arbitrary 
real valued matrices. For the (i, 7)-th entry of a matrix, we may write Aj; 
or [A]i; (the latter when we want to explicitly identify A as a matrix). The 
transpose AT € R?*” is a matrix whose entries are given by [AT]; = [A]ji- 
The matrix-vector and matrix-matrix products are defined in the usual way, 


[Ax]; = Zf Aijaj, where x €R, A € RN*4 and Ax € RY; 
[AB]; = Jf; ABr, where @cAR™™?, Boe R™ and AB € RNX™, 
In general, when we refer to products of matrices below, assume that all the 
products exist. Note that A(B +C) = AB+AC and (AB)! = BTA”. The dxd 
identity matrix Ig is the matrix whose columns are the standard basis, having 


diagonal entries 1 and zeros elsewhere. If A is a square matrix (d = N), the 
inverse A~! (if it exists) satisfies 


KASD = AA = Ty: 


Note that 


S 
T 
| 

Il 


Ce 
(AB)! = (B)A™!. 


A matrix is invertible if and only if its columns (also rows) are linearly inde- 
pendent (in which case the columns are a basis). 

If the matrix A has orthonormal columns, then ATA = Ig; if in addition 
A is square, then it is orthogonal (and invertible). It is often convenient to 
refer to the columns of a matrix, and we will write A = [aj,...,aqg]. Using 
this notation, one can write a matrix vector product as 


d 
Ax = y Tiai, 
i=1 
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from which we see that a matrix can be viewed as an operator that takes 
the input x and transforms it into a linear combination of its columns. The 
range of a matrix A is the subspace spanned by its columns, range(A) = 
span({ai,...,aa}). It is also useful to define the subspace spanned by the 
rows of A, which is the range of AT. The dimension of the range of A is called 
the column-rank of A. Similarly, the dimension of the subspace spanned by 
the rows is the row-rank of A. It is a useful fact that the row and column 
ranks are equal, and so we can define the rank of a matrix, denoted p, as the 
dimension of its range. Note that 


p(A) = rank(A) = rank(A*A) = rank(AA’). 


The matrix A € R”*4 has full column rank if rank(A) = d; it has full row 
rank if rank(A) = N. Note that rank(A) < min(N, d), since the dimension of 
a space spanned by £ vectors is at most ¢ (see Exercise B.1(g)). 


Exercise B.2 


i 2 @ i 2 1 

Lee AS 12 il ©], 1e= fil aSa 

0 0 3 2 0 3 

(a) Compute (if the quantities exist): AB; Ax; Bx; BA; B™A™; x™Ax; 
BAB; Aq. 
SS: 
(b) Show that x"Ax = )> denies Nae. 

i=1 j=1 


(c) Find a v for which Av = Av. What are the possible choices of A? 


(d) How many linearly independent rows are there in B? How many 
linearly independent columns are there in B? 


(e) Given any matrix A = [aj,...,aq], let ci,...,¢, be a basis for the 
span of ai,...,aq. Let C be the matrix whose columns are this basis, 
C = [c1,...,c,]. Show that for some matrix R, one can write 
A=CR. 


(i) What is the column-rank of A? 
(ii) What are the dimensions of R? 
(iii) Show that every row in A is a linear combination of rows in R. 


(iv) Hence, show that the dimension of the subspace spanned by the 
rows of A is at most r, the column-rank. That is, 


column-rank(A) > row-rank(A). 


(v) Show that column-rank(A) < row-rank(A), and hence that the 
column-rank equals the row-rank. [Hint: Consider A*.] 
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e-B. LINEAR ALGEBRA B.2. SVD AND PSEUDO-INVERSE 


B.2 SVD and Pseudo-Inverse 


The singular value decomposition (SVD) of a matrix A is one of the most 
useful matrix decompositions. For the matrix A, assume d < N and the rank 
p < d. The SVD of A factorizes A into the product of three special matrices: 


A = UTV". 


The matrix U € RY*? has orthonormal columns that are called the left sin- 
gular vectors of A; the matrix V € R?*? has orthonormal columns that are 
called right singular vectors of A. So, 


UTU =V'V = I}. 


The matrix T € R?%? is diagonal and has as its diagonal entries the (positive) 
singular values of A, yi = [LC]. Typically, the singular values are ordered, 
so that y, > --- > yp; in this case, the first column of U is called the top 
left singular vector, and similarly the first column of V is called the top right 
singular vector. The condition number of A is the ratio of the largest to 
smallest singular values: kx = 71/7), which plays an important role in the 
stability of algorithms involving matrices, such as solving the linear regression 
problem or inverting the matrix. Algorithms exist to compute the SVD in 
O(Ndmin(N,d)) time. If only a few of the top singular vectors and singu- 
lar values are needed, these can be obtained more efficiently using iterative 
subspace methods such as power iteration. 

If A is not invertible (for example, not square), it is useful to define the 
Moore-Penrose pseudo-inverse At, which satisfies four properties: 


(i) AATA = A; (ii) ATAAT = AT; (iii) (AAT)T = AAT, (iv) (ATA)T = ATA. 
The pseudo-inverse functions as an inverse and plays an important role in 


linear regression. 


Exercise B.3 
If the SVD of A is A = UTV”, by checking all four properties, verify that 


Al = VTU" 


is a pseudo-inverse of A. Also show that (AT)? = (AT). 


If A and B both have rank d or if one of A, BT has orthonormal columns, then 
(AB) = (B)'AT. The matrix Ha = AAt = UU" is the projection operator 
onto the range (column-space) of A; this projection operator can be used to 
decompose a vector x into two orthogonal parts, the part xa in the range of 
A and the part xq, orthogonal to the range of A: 


x =THax+ (I- Ila)x. 
“a —_—_——_ 
XA XA] 


A projection operator satisfies II? = II. For example II, = (AATA)AT = AAT, 
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B.3 Symmetric Matrices 
A square matrix is symmetric if A = A?” (anti-symmetric if A = —A™). Sym- 
metric matrices play a special role because the covariance matrix of the data 


is symmetric. If X is the data matrix, then the centered data matrix Xeen is 
obtained by subtracting from each data point the mean vector 


ie 1 
ge ee | 


where 1 is the N-dimensional vector of ones. In matrix form, 


Xcen = X- 1p 
= X l 11°X 
E N 


1 
I- —117) X. 


From this expression, it is easy to see why ( — +11") is called the centering 
operator. The covariance matrix X is given by 


N 
1 
B= ay Dn H) len =H) = 5p XenXeon 


Via the decomposition A = $(A + AT) + $(A — A"), every matrix can be 
decomposed into the sum of its symmetric part and its anti-symmetric part. 
Analogous to the SVD, the spectral theorem says that any symmetric matrix 
admits a spectral or eigen-decomposition of the form 


A = UAUT, 


where U has orthonormal columns that are the eigenvectors of A, so UTU =I, 
and A is diagonal with entries A; = [A]i; that are the eigenvalues of A. Each 
column u; of U for i = 1,...,p is an eigenvector of A with eigenvalue A; 
(the number of non-zero eigenvalues is equal to the rank). Via the identity 
AU = UA, one can verify that the eigenvalues and corresponding eigenvectors 
satisfy the relation 

All eigenvalues of a symmetric matrix are real. Eigenvalues and eigenvec- 
tors can be defined more generally using the above equation (even for non- 
symmetric matrices), but we will only be concerned with symmetric matrices. 
If the A; are ordered, with A; >--- > Ap, then A; is the top eigenvalue, with 
corresponding eigenvector u1, and so on. Note that the eigen-decomposition 
is similar to the SVD (A = UTV?) with the flexibility to have negative en- 
tries along the diagonal in A. Since AAT = UA?UT = UT?UT, one identifies 
that A? = 7?. That is, up to a sign, the eigenvalues and singular values of a 
symmetric matrix are the same. 
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e-B. LINEAR ALGEBRA B.4. TRACE AND DETERMINANT 


B.3.1 Positive Semi-Definite Matrices 


The matrix A is symmetric positive semi-definite (SPSD) if it is symmetric 
and for every non-zero x € R”, 


x™Ax > 0. 
One writes A > 0; if for every non-zero x, 
x’ Ax > 0, 


then A is positive definite (PD) and one writes A > 0. If A is positive definite, 
then A; > 0 (all its eigenvalues are positive), in which case A; = y; and the 
eigen-decomposition and the SVD are identical. We can write A = S? and 
identify A = US(US)" from which one deduces that every SPSD matrix has 
the form A = ZZ". The covariance matrix is an example of an SPSD matrix. 


B.4 Trace and Determinant 


The trace and determinant are defined for a square matrix A (so N = d). The 
trace is the sum of the diagonal elements, 


trace(A) = y Aii. 


The trace operator is cyclic: 
trace(AB) = trace(BA) 


(when both products are defined). From the cyclic property, if O has orthonor- 
mal columns (so O70 = I), then trace(OAOFT) = trace(A). Setting O = U 
from the eigen-decomposition of A, 


N 
trace(A) = trace(A) = 5 Ai 


i=] 


(sum of eigenvalues). Note that trace(I,) = k. 

The determinant of A, written |A|, is most easily defined using the totally 
anti-symmetric Levi-Civita symbol ¢;, i, (if you swap any two indices then 
the value negates, and €1.2,....v = 1; if any index is repeated, the value is zero): 


N 
|A] = > Ei, in Ati, ANin- 


i1,..3n=1 


Since every summand has exactly one term from each column, if you scale 
any column, the determinant is correspondingly scaled. The anti-symmetry 
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e-B. LINEAR ALGEBRA B.4. TRACE AND DETERMINANT 


of €i... iẹ implies that if you swap any two columns (or rows) of A, then 
the determinant changes sign. If any two columns are identical, by swapping 
them, the determinant must change sign, and hence the determinant must be 
zero. If you add a vector to a column, then the determinant becomes a sum, 


\[ay,...,aj tv,...,ay]| = |[ar,...,aj,---,an]] + |[ar,...,Vv,---, an]. 


If v is a multiple of a column then the second term vanishes, and so adding a 
multiple of one column to another column does not change the determinant. 
By separating out the summation with respect to 7; in the definition of the 
determinant (the first row of A), we get a useful recursive formula for the 
determinant known as its expansion by cofactors. We illustrate for a 3 x 3 
matrix below: 





A11 A12 A13 Ai A12 A13 
A21 A22 A23 = A22 A23 | — |A21 A23| + | A21 A22 
Asi A32 A33 A32 A33 Asi A33 A31 A32 




















= A11ı(A224A33 — A23A32) — A12(A21 A33 — A23A31) 
+A13(A21A32 — A22A31). 

(Notice the alternating signs as we expand along the first row.) Geometrically, 
the determinant of A equals the volume enclosed by the parallelepiped whose 
sides are the column vectors a1, ..., ayn. This geometric fact is important when 
changing variables in a multi-dimensional integral, because the infinitesimal 
volume element transforms according to the determinant of the Jacobian. A 
useful fact about the determinant of a product of square matrices is 


|AB| = |A]|BI. 


Exercise B.4 
Show the following useful properties of determinants. 
(a) |Ix| = 1; if D is diagonal, then |D| = J^ [D]ic. 


=i 


1 
(b) Show that |A| = => DY Š canan Aan Ayi 
E ae 


(c) Using (b), argue that |A| = | AF]. 


(d) If O is orthogonal (square with orthonormal columns, i.e. OTO = Ix), 
then |O| = +1 and |[OAO"] = |A]. [Hint: |O7O| = |O||O*|.] 


(e) [AT] = 1/]A]. Hint: |A7*A] = IIn]: 
(f) For symmetric A, 


|A| = Il ri (product of eigenvalues of A). 
i=1 


[Hint: A = UAU”, where U is orthogonal.] 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:B-8 
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B.4.1 Inverse and Determinant Identities 


The inverse of the covariance matrix plays a role in many learning algorithms. 
Hence, it is important to be able to update the inverse efficiently for small 
changes in a matrix. If A and B are square and invertible, then the following 
useful identities are known as the Sherman-Morrison-Woodbury inversion and 
determinant formulae: 


(A+ XBY")"+ = A`! -— A`!X(B! + YTA`!X) IYTAT!; 
|A+XBY"| = |A||B|IB-'+ YTATtX]. 


An important special case is the inverse and determinant updates due to a 
rank 1 update of a matrix. Setting X = x, B = 1 and Y = y, 


a A`txy"A t 
1+yTAix’ 
A+xy"] = Aa 


(A+xy") = 








Setting x = +y gives the updates for A+xx". If A has a block representation, 
we can get similar updates to the inverse and determinant. Let 





Ai Ai 
A — 
Ne: A22) ° 
and define 
Fy = A1 — A12A55 A21 
F2 = Ago — A21 AJ A12 
Then, 
a J FI" -AJ AF; 
A = Ea Hest 5 and 
[A] = |Ago||Fi] = |A1||F2l. 
Again, an important special case is when 
X b 
a= [E 3] 
In this case, 
xX! O 1 X-'bb™X-! —X~!b 
—1 — 
8 Tes | o7 4 c— bTX-!b | —b"X-! Pe ane 
|A| = |X|(c— b™X~'b). 
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Exercise B.5 


Use the formula for the determinant of a 2 x 2 block matrix to show 
Sylvester's determinant theorem for matrices A € R”*? and B € R@*”: 


[In + AB] = |Ia + BAI. 


| 


Use Sylvester’s theorem to show the Sherman-Morrison-Woodbury deter- 
minant identity 


Thy = AN 


Hint Consider Se i 








|A + XBY"| = |A||B||B7} + Y7A~?X]. 


[Hint: A + XBY" = A(I + A` XBY"), and use Sylvester's theorem.] 


B.5 Inner Products, Matrix and Vector Norms 


For vectors x and z, the inner product is a symmetric, bilinear positive definite 
function usually denoted (x, z). The standard Euclidean inner product (dot 
product) is and example: 


d 


(x, Z) =x. ZZ = Y tizi. 


i=1 


The inner-product induced norm of a vector x (the Euclidean norm) is 








2 
IIx" = (xx) = x™ = SP z7, 


where the latter two equalities are for the Euclidean inner product. The 
Cauchy-Schwarz inequality bounds the inner-product by the induced norms, 


2 2 
(x, 2)? < |[x|["[l2ll’, 


with equality if and only if x and z are linearly dependent. The special 
case of the Cauchy-Schwarz inequality for the Euclidean inner product was 
given earlier in the chapter, (x » z) = (30%, 2:2)? < Z4] x? Sai z2. The 
Pythagorean theorem applies to the square of a sum of vectors, and can be 
obtained by expanding (x + y)" (x + y) to obtain: 


2 2 2 
lx+ yl = Ixl + Iyl" + 2x"y. 
If x is orthogonal to y, then ||x + yl|? = ||x|| + |ly|l°, which is the familiar 


Pythagorean theorem from geometry where x+y is the diagonal of the triangle 
with sides given by the vectors x and y. 
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Associated with any vector norm is the spectral (or operator) matrix norm 
\|A||, which measures how large a vector transformed by A can get. 


ee 
If U has orthonormal columns then ||Ux|| = ||x|| for any x. Analogous to the 


Euclidean norm of a vector is the Frobenius matrix norm ||A||p which sums 
the squares of the entries: 


IAI = = trace(AA™) = trace(A7A) = Seo 


i=1 j=l 
Exercise B.6 
Using the SVD of A, A = UTV”, show that 
Ale = 7a (top singular value); 
All? = 2. (sum of squared singular values). 


Note that 
Allo = lA] F; 
lAl < Alle < VellAlle, 


where p = rank(A). The matrix norms satisfy the triangle inequality and a 
property known as submultiplicativity, which bounds the norm of a product, 


IA +Blə p < lAl r + IBll2 r; 
| AB||, < IAllallBll2; 
IAB] < AllBr- 
The generalized Pythagorean theorem states that if ATB = 0 or ABT = 0 then 
IA + B|} = lAl} + Bll 
max{||All3, IBI3} < IIA + BII2 < IAI} + IIBIb- 








B.5.1 Linear Hyperplanes and Quadratic Forms 
A linear scalar function has the form 
L(x) = wo + w™x, 


where x,w € R?. A hyperplane in d dimensions is defined by all points x 
satisfying L(x) = 0. For a generic point x, its geometric distance to the 
hyperplane is given by 


distance(x, (wo, w)) = 
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The vector w is normal to the hyperplane. A quadratic form q(x) is a scalar 
function 


q(x) = wo + wx + $x™Qx. 


When Q is positive definite, the manifold g(x) = 0 defines an ellipsoid in 
d-dimensions (recall that x € R®). 


B.6 Vector and Matrix Calculus 


Let E(x) be a scalar function of a vector x € R. The gradient V E(x) is a d- 
dimensional vector function of x whose components are the partial derivatives; 
the Hessian Hg(x) is a d x d matrix function of x whose entries are the second 
order partial derivatives: 


0 





VEG), = SE; 
Hey = agp 


The gradient and Hessian of the linear and quadratic forms are: 


linear form: V(x) = w; He(x) = 
quadratic form: Vq(x) =w+Qx; H,(x) 


For the general quadratic term x"Ax, the gradient is 
V(x" Ax) = (A+ A™)x. 


A necessary and sufficient condition for x* to be a local minimum of E(x) is 
that the gradient at x* is zero and the Hessian at x* is positive definite. The 
Taylor expansion of E(x) around Xo up to second order terms is 


E(x) = E(xo) + VE(x)" (x — xo) + 5x — xo) He (x)(x — xo) +. 


If xo is a local minimum, the gradient term vanishes and the Hessian term is 
positive. 


B.6.1 Multidimensional Integration 


If z = q(x) is a vector function of x, then the Jacobian matrix J contains the 
derivatives of the components of z with respect to the components of x: 
_ Oz; 





[I]; 
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The Jacobian is important when performing a change of variables from x to z 
in a multidimensional integral, because it relates the volume element in x- 
space to the volume element in z-space. Specifically, the volume elements are 
related by 


As an example of using the Jacobian to transform variables in an integral, 
consider the multidimensional Gaussian distribution 


1 


— em E al 
(Qn)@2|y]172 i 


P(x) = 


Rexd 


where pu € R? is the mean vector and X € is the covariance matrix. We 


can evaluate the expectations 














ox] / dxx- P(x) an 
spot] = f dex" Re) 














by making a change of variables. Specifically, let & = UAUT, and change vari- 
ables to z = A~!/2U"(x— u). The Jacobian is J = A~!/?U? with determinant 


























|J| = |A|-1/?. The integrals for the expectations then transform to 
—32'2 
ax] = 1/2 ances 
E[x] = pa (UA 4z + p) (Ona? 
= p 
—4atz 
xx") = dz (UAM?2z27AY/2U" + UAZ" + pz AM/2UT + pop?) - a 
Q7)4/2 
= Ð+pp”. 


Exercise B.7 


Show that the expressions for the expectations above do indeed transform to 
the integrals claimed in the formulae above. Carry through the integrations 
(filling in the necessary steps using standard techniques) to obtain the final 
results that are claimed above. 


B.6.2 Matrix Derivatives 


We now consider derivatives of functions of a matrix X. Let q(X) be a scalar 
function of a matrix X. The derivative 0q(X)/OX is a matrix of the same size 
as X with entries 


[ax], = eal) 
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Similarly if a matrix X is a function of a scalar z then the the derivative 
OX(z)/Oz is a matrix of the same size as X obtained by taking the derivative 
of each entry of X: 
o o 
j 
The matrix derivatives of several interesting functions can be expressed in a 
convenient form: 


: ð 
Function OX 


trace(AXB) ATB? 
trace(AXX"B) ATBTX + BAX 
trace(XTAX) (A + AT)X 
trace(X~1A) -X7 An] 
trace(0(BX)A) B? (0'(BX) 8 A7) 
trace(A0(BX)"0(BX)) | B" (0' (BX) 8 [0(BX)(A + A?)]) 
a™Xb ab” 
|AXB| |AXB|(XT)~} 
in [X| Com 


(In all cases, assume the matrix products are well defined and the argument to 
the trace is a square matrix. When A and B have the same size A B denotes 
component-wise multiplication, [A 8 B];; = Ai;B,;;. A function applied to 
a matrix denotes the application of the function to each element, [(A)]i; = 
6(Aj;); 0’ is the function which is the functional derivative of 6.) We can also 
get the derivatives of matrix functions of a parameter z: 


; ð 
Function Jz 





x-1 -x-1 ex 
AXB ASB 
XY XÆ + By 
In |X| trace (X~ 12S) 


X = X(z); Y = Y(z) 
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